{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cd4fd0c8",
   "metadata": {},
   "source": [
    "# Tutorial: Prepare your own documents for vector search\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fa95f9a3",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install langchain pypdf langchain-openai --quiet"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "01e963f3",
   "metadata": {},
   "source": [
    "## 1. Upload your documents\n",
    "First, remove the existing files in the `/docs` folder and add your own PDF files. Then, run the cells below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc06127b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a loader that processes all files in the docs directory\n",
    "import os\n",
    "from langchain_community.document_loaders import PyPDFLoader\n",
    "\n",
    "# Path to docs directory\n",
    "docs_dir = \"./docs\"\n",
    "\n",
    "# Get all files in the directory\n",
    "all_files = [os.path.join(docs_dir, f) for f in os.listdir(docs_dir) \n",
    "             if os.path.isfile(os.path.join(docs_dir, f))]\n",
    "\n",
    "# Process each file in the directory\n",
    "documents = []\n",
    "for file_path in all_files:\n",
    "    try:\n",
    "        loader = PyPDFLoader(\n",
    "            file_path=file_path,\n",
    "        )\n",
    "        docs = loader.load()\n",
    "        documents.extend(docs)\n",
    "        print(f\"Loaded {len(docs)} chunks from {file_path}\")\n",
    "    except Exception as e:\n",
    "        print(f\"Error loading {file_path}: {e}\")\n",
    "\n",
    "print(f\"Loaded total of {len(documents)} document chunks\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aacb83c0",
   "metadata": {},
   "source": [
    "## 2. Chunk documents\n",
    "Split large documents into smaller chunks for better embedding quality."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "63b66500",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "\n",
    "# Split documents into chunks of 1000 characters with 200 characters overlap\n",
    "text_splitter = RecursiveCharacterTextSplitter(\n",
    "    chunk_size=1000,\n",
    "    chunk_overlap=200\n",
    ")\n",
    "\n",
    "chunks = text_splitter.split_documents(documents)\n",
    "print(f\"Created {len(chunks)} chunks\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "263be846",
   "metadata": {},
   "outputs": [],
   "source": [
    "# print chunks\n",
    "for chunk in chunks:\n",
    "    print(chunk.page_content)\n",
    "    print(\"-\"*100)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1e96b64b",
   "metadata": {},
   "source": [
    "## 3. Generate embeddings\n",
    "Use OpenAI embeddings to encode each chunk into a vector."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "01ae89f4",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_openai import OpenAIEmbeddings\n",
    "# define embeddings as default OpenAI embeddings\n",
    "embeddings = OpenAIEmbeddings()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "37b87e5f",
   "metadata": {},
   "source": [
    "## 4. Store embeddings in Chroma\n",
    "Initialize a Chroma vector store and persist it locally.\n",
    "\n",
    "If you run into a  \"OperationalError: attempt to write a readonly database\" - restart the kernel and rerun the notebook.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2901d768",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.vectorstores import Chroma\n",
    "\n",
    "# create vector store with Chroma\n",
    "vectordb = Chroma.from_documents(\n",
    "    chunks,\n",
    "    embedding=embeddings,\n",
    "    persist_directory=\"db\",\n",
    "    collection_name=\"my_custom_index\"\n",
    ")\n",
    "vectordb.persist()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4700d544",
   "metadata": {},
   "source": [
    "## 5. Example similarity search\n",
    "Perform a similarity search query on your vector store."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "207b9599",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Test similarity search\n",
    "query = \"robotics\"\n",
    "results = vectordb.similarity_search(query, k=5)\n",
    "for i, doc in enumerate(results):\n",
    "    print(f\"Result {i+1}: {doc.page_content}...\\n\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
